Determination of factors governing fibrillogenesis of polypeptide chains using lattice 

models 
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Using lattice models we explore the factors that determine the tendencies of polypeptide chains 
to aggregate by exhaustively sampling the sequence and conformational space. The morphologies 
of the fibril- like structures and the time scales (jfib) for their formation depend on a subtle bal- 
ance between hydrophobic and coulomb interactions. The extent of population of N*, which is a 
fibril-prone structure in the spectrum of monomer conformations, is the major determinant of r/^. 
This observation is used to determine the aggregation-prone consensus sequences by exhaustively 
exploring the sequence space. Our results provide a basis for genome wide search of fragments that 
are aggregation prone. 

PACS numbers: 87.15.A,87.14.E 



Proteins that are unrelated by sequence or structure 
aggregate to form amyloid-like fibrils with a character- 
istic cross f3- structures, which are linked to a number 
of deposition diseases such as Alzheimer's and prion- 
disorders [l| (a) . The observation that almost any protein 
could form fibrils seemed to imply that fibril rates can be 
predicted solely based on sequence composition and the 
propensity to adopt global secondary structure - a con- 
clusion that has limited validity. Despite the common 
structural characteristics of organized aggregates such as 
amyloid fibrils [l](b)-(e) the factors that determine the 
fibril formation tendencies are not understood. 

Experiments on fibril formation times (r/^) have been 
rationalized using global factors such as the hydropho- 
bicity of side chains 0(a), net charge Q(b,c), patterns 
of polar and non-polar residues [2](d), frustration in sec- 
ondary structure elements [2[(e,f), and aromatic inter- 
actions [2](g). However, the inability to sample the se- 
quences and conformational spaces exhaustively Q has 
prevented deciphering plausible general principles that 
govern protein aggregation using limited computations 
and experiments. The purpose of this letter is to obtain 
a quantitative correlation between intrinsic properties of 
polypeptide sequences and their fibril growth rates using 
the lattice models, which have given remarkable insights 
into the general principles of protein folding and aggre- 
gation [j. Using a modification of the model in [5j we 
explore the sequence-dependent variations of r/^ on the 
nature of conformations explored by the monomer. In 
particular, we highlight the role of N*, which is one of the 
aggregation prone structures [6[ in the folding landscape 
of the monomer in determining Tfib and the propensity 
of sequences to form fibrils. 

Lattice model. To explore the dependence of fibril 



formation rates on the intrinsic properties of monomers 
and the sequence, we use a lattice model [5[ in which each 
chain consists of M connected beads that are confined to 
the vertices of a cube. The simulations are done using 
TV identical chains with M = 8. The peptide sequence 
which is used to illustrate the roles of electrostatic and 
hydrophobic interaction is +HHPPHH- (Fig. [1]), where 
H, P, + and - are hydrophobic, polar, positively charged 
and negatively charged beads respectively [5[. 

The inter- and intra-chain potentials include excluded 
volume and contact (nearest neighbor) interactions. Ex- 
cluded volume is imposed by the condition that a lattice 
site can be occupied by only one bead. The energy of TV 
chains is [5[ 
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where nj is the distance between residues i and j, a is a 
lattice spacing, sm(i) indicates the type of residue i from 
ra-th peptide, and (5(0) = 1 and zero, otherwise. The 
first and second terms in Eq. [T] represent intrapeptide 
and interpeptide interactions, respectively. 

The propensity of polar (including charged) residues to 
be "solvated" is mimicked using Ep a =-0.2 (in the units 
of hydrogen bond energy e#), where a= P,+,or -. To 
assess the importance of electrostatic and hydrophobic 

interactions, we vary either in the interval — 1.4 < 

E^ < or Ehh between -1 and 0. If E^ is varied, we 

set Ehh = — 1, while if Ehh is varied, then E+_ = —1.4. 

We used E++ = E = —E+-/2 and all other contact 

interactions have E a p = 0.2. 

Monomer spectra depends on E-\ and Ehh* 
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FIG. 1: (a) Spectrum of energies and associated low energy 
structures of the monomer sequence +HHPPHH-. H, P, + 
and - are in green, yellow, blue, and red, respectively. We 
set Ehh = — 1 and = 0. There are a total 1831 possi- 
ble conformations that are spread among 17 possible energy 
values. The N* structure enclosed in the box coincides with 
the peptide state in the fibril (see Fig. [2Jl). (b) The proba- 
bility Pn* of populating the aggregation-prone structure N* 
as a function of T for E-\ = 0, —0.3, 0.6, —1 and -1.4 keep- 
ing Ehh = — 1. The arrow indicates T*, where Pn* — Pn* x - 
Dependence of P™ x on E + - for Ehh — -1 (c), and on Ehh 
for E + - = -1.4 (d). 



The spectrum of energy states of the monomer for a given 
sequence is determined by exact enumeration of all pos- 
sible conformations (Fig. [T}. For all sets of contact ener- 
gies chosen above the native state (NS) of the monomer 
is compact (lowest energy conformation in Fig. [TJi). In 
anticipation of the role N* (the structure encircled in the 
box in Fig. plays in promoting fibril formation we fo- 
cus on the change in the rank order of the N* energy level 

as is varied. For E^ < 0, N* is the first excited 

state [5]. However, if E + _ = 0, N* is one of the 19-fold 
degenerate states in the second excited state (Fig. [TJl) . 
The energy gap between the monomeric NS and state 
N*, AE = E N * - E NS = 0.4 for the sequences in Fig. [TJ 
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FIG. 2: (a) The lowest energy fibril structure for E-\ = 

— 1.4 and Ehh — — 1. (b) Same as in (a) but with E-\ — 

0. (c) Double layer structure for Ehh = —0.4 but keeping 
E+- = -1.4. (d) For E+- = -1.4 and Ehh = -0.3 the fibril 
structure is entirely altered, (e) Temperature dependence of 

Tfib for E-\ = —1.4 (circles) and E-\ = —0.6 (triangles). 

N = 6 and Ehh — — 1. Arrows show the temperatures at 
which the fibril formation is fastest. 



is a constant. 

The population of the putative fibril-prone conforma- 
tion in the monomeric state is Pn* = exp(— En*)/Z, 
where Z is the partition function is obtained by exact 
enumeration. Fig. [T]3, shows the temperature depen- 
dence of Pat* for various values of (E^ ) interaction, with 

Ehh = — 1 and other contact energies constant. Depend- 
ing on E^ , the maximum value of P/v* varies from 2% 

< P^ x < 12% (Fig. Ht). P™? x decreases to a lesser 
extent as the hydrophobic interaction grows (Fig. [T]i). 
Here we consider only Ehh < — 0.4 because the fibril-like 
structure is not the lowest-energy when Ehh > —0.4 . 

Morphology of lowest-energy structures of 
multi-chain systems depends on sequences. When 
multiple chains are present in the unit cell, aggregation is 
readily observed, and in due course they lead to ordered 
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structures. We used the Monte Carlo (MC) [5] annealing 
protocol, which allows for an exhaustive conformational 
search, to find the lowest energy conformation. For non- 
zero values of the chains adopt an antiparallel ar- 
rangement in the ordered protofilament, which ensures 
that the number of salt-bridge and hydrophobic contacts 
are maximized (Fig. [2^ and see alsoRef. |5[). If £+_ = 
then the lowest energy fibril structure has a vastly dif- 
ferent architecture even though they are assembled from 
N* (Fig. Eb). The structure in Fig. EJd, in which a pair 
of N* conformations are stacked by flipping one with re- 
spect to the other is rendered stable by maximizing the 
number of +P and -P contacts. We now set E+- = — 1.4 
and vary Ehh- For Ehh < —0.4, the fibril conforma- 
tion adopts the same shape as that shown in Fig. [2^, but 
for Ehh — — 0.4 the energetically more favorable double- 
layer structure emerges (Fig. (2^). If Ehh > —0.3, then 
the lowest-energy conformation ceases to have the fibril- 
like shape (Fig. [2]i). The close packed heterogenous 
structure is stitched together by a mixture of the NS con- 
formation and one of the second excited conformations. 
Even for this simple model a variety of lowest-energy 
structures of oligomers and protofilaments with differ- 
ent morphologies emerge, depending on a subtle balance 
between electrostatic and hydrophobic interactions. 

Dependence of Tfib on and Ehh- We use MC 

algorithm to study the kinetics of fibril assembly. Simula- 
tions were performed by enclosing N chains in a box with 
periodic boundary conditions jsj. The monomer concen- 
tration was kept « 6 mM (roughly the length of the cubic 
size is 35a for N = 10 monomers) for all systems. The fib- 
ril formation time Tfib is defined as an average of first pas- 
sage times needed to reach the fibril state with the lowest 
energy starting from initial random conformations. For 
a given value of T, we generated 50-100 MC trajectories 
to obtain reliable estimates of r/^. MC moves include 
global and local ones. A local move [7[ corresponds to 
tail rotation, corner flip, and crankshaft rotation. Global 
moves correspond to either translation of a peptide by a 
in a randomly chosen direction or rotation by 90° around 
one of the randomly chosen coordinate axes. The accep- 
tance probabilities of global and local moves are 0.1 and 
0.9, respectively [5[. We measure time in units of Monte 
Carlo steps (MCS). The combination of local and global 
moves constitutes one MCS. 

The temperature dependence of Tfib displays a U-shape 
(Fig. [2k) and the fastest assembly occurs at T m ^ n , which 
roughly coincides with the temperature, T*, where P/v* 
reaches maximum (Fig.^). To probe the correlation be- 
tween Tfib and E+- and Ehh we performed simulations 
at Train. The dependence of Tfib on E+_ can be fit using 

Tfib ~ exp[— c(— E-\ ) a ] where a ~ 0.6 and the constant 

c w 7.12 and 9.23 for N = 6 and 10 respectively (Fig. EH)- 

Thus, variation of E-\ drastically changes not only the 

morphology of the ordered protofilament (Fig. [2]), but 
also Tfib. As the strength of the charge interaction be- 
tween the terminal beads increases, the faster is the fibril 
formation process. Interestingly, the fibril formation rate 
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FIG. 3: (a) Dependence of Tfib on for N = 6 (circles) 

and N = 10 (triangles) with Ehh = — 1. The solid curves 
are fits to y = Co + c(— x) a , where a ~ 0.59. Co — 21.32 
and c -7.12 and c 25.14 and c -9.23 for N = 6 
and 10, respectively, (b) Dependence of 77^ on Ehh with 

E^ = — 1.4 hold constant for N = 6 (solid circle) and N = 

10 (solid triangles). Lines are fits y — 19.17 + 7.97x and 
y = 22.69 + 8.56x for N = 6 and 10, respectively. For N = 
6 the first point Ehh = — 1 is excluded from fitting, (c) 
Dependence of Tfib on P™ x for N = 6 and 10. Symbols are 
the same as in (a) and (b) Tfib is measured in MCS and P^* x 
in %. The correlation coefficient for all fits R ~ 0.98. 



at E^ = is about four orders of magnitude slower than 

that at E-i = —1.4. Our model shows that, in agree- 
ment with previous studies [l](f), the propensity to fibril 
assembly strongly depends on the charge states of the 
polypeptide sequences. 

By fixing E^ = —1.4 we calculated the dependence 

of Tfib on the hydrophobic interaction (Fig. EJb), which 
may be approximated using Tfib ~ ^p(cEhh)- Here 
constant c ~ 7.97 and 8.56 for N = 6 and 10, respectively. 
For N = 10, a change in hydrophobicity of AEhh = 
0.6, leads to self-assembly rates that are more than two 
orders of magnitude. In accord with experiments [H, 
our finding shows that the enhancement of hydrophobic 
interactions speeds up fibril formation rates. 

Fibril formation rates depend on P/v*. The dra- 
matic variations in Tfib on E^ and Ehh prompted us to 

link the underlying spectrum of monomer conformations 
to Tf^. To establish such a relation we collected all the 
data in Fig. [3Jl and [3}d and plot them as a function of 
pmax ^p-g [3^ ^ e f oun( ^ the surprising relation 

T fib =T% b exp(-cP%n, (2) 
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where the prefactor r*} ib « 1.014 x 10 10 MCS and 
3.981 xlO 11 MCS, and c « 0.9 and 1.0, for N = 6 and 
10, respectively. There are a few implications of the cen- 
tral result given in Eq. O (i) The sequence-dependent 
spectrum of the monomer is a harbinger of fibril forma- 
tion. In proteins there are multiple N* conformations 
corresponding to distinct free energy basins of attrac- 
tion [6](b). Aggregation from each of the structures in 
the various basins of attraction could lead to fibrils with 
different morphologies (polymorphism) that cannot be 
captured using lattice models, (ii) Enhancement of P/v* 
either by mutation or chemical cross linking should in- 
crease fibril formation rates. Indeed, a recent experi- 
ment Q showed that the aggregation rate of A/3i_4o- 
lactam[D23-K28], in which the residues D23 and K28 are 
chemically constrained by a lactam bridge, is nearly a 
1000 times greater than in the wild-type. Since the salt 
bridge constraint increases the population of the N* con- 
formation in the monomeric state [10]. It follows from 
Eq. [2j Tfib should decrease, (hi) Since P/v* (T) depends 
on the spectrum of the precise sequence, it follows that 
the entire free energy landscape of the monomer @(b) 
and not merely the sequence composition as ascertained 
else where[l](f), should be considered in the predictions 
of the amyloidogenic tendencies of a particular sequence. 

Scanning the sequence space. We further exploit 
the result in Eq.[2]to determine the amylome the uni- 
verse of sequences in the lattice model, that can form fib- 
rils. We posit that aggregation prone sequences are those 
with a unique native state with a maximum in P/v* (T) 
in the interval 1.0 < T*/T F < 1.25. Out of the 65,536 



sequences only 217 satisfy these criteria (see Supplemen- 
tary Information (SI) for details). The sequence space 
exploration shows that there is a high degree of corre- 
lation between the positions of charged and hydropho- 
bic residues leading to a limited number of aggregation 
prone sequences with +HHPPHH- being an example. In 
addition, there are substantial variations in T*/Tp for 
sequences with identical sequence composition, which re- 
inforces the recent finding [11] that context in which 
charged and hydrophobic residues are found is important 
in the tendency to form amyloid-like fibrils. 

We have shown that there is a strong correlation be- 
tween the extent of population of N* structures (in pro- 
teins we expect multiple aggregation prone conforma- 
tions), which depend on the exact sequence and the 
environment. Interestingly, P^ x , which roughly anti- 
correlates with A(= [Ens — En*]/Ens) (see Fig. 1 in 
SI) for aggregation prone sequences but correlates with 
A (or equivalently Z-score 0(a)) for foldable sequences. 
Due to the strong dependence of t/^ on P™ x we suggest 
that only limited number of sequences are aggregation 
prone (see SI for details). Although the conclusions were 
obtained using lattice models they should hold for pep- 
tide aggregation. Our study provides a basis for genome 
wide search for consensus sequences with propensity to 
aggregate. 
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Supplementary Information: Determination of factors governing fibrillogenesis of polypeptide chains using 

lattice models 

For the lattice model with M beads and four types of residues (H, P, -f , -), there are 4 M sequences. Out of the 65536 

(for M = 8) sequences 5950 have a unique ground state. Among the 65,536 sequences roughly half are mirror images 

of each other. Although, in reality these would be distinct sequences, in the lattice model they would yield identical 

thermodynamic properties. Since these sequences are redundant, we do not include them in the statistical analysis. 

To determine the amylome [1] (space of sequences in the lattice model that are amyloidogenic) we first determined 

the folding temperature for the 5950 sequences using the condition P]\rs(Tf) = 0.5, where PNs(Tf) = e~ l3 f ENS / Z 

v 

where j3f = l/fc^Xy, Ens is the energy of the ground state, and Z = ^^e~^ Ei ; Z can be exactly evaluated for each 

1=1 

sequence because for M = 8 the number of conformations, p, that satisfy self- avoidance is only 1831. The energies 
Ei for all the conformations are computed using the chosen interaction energies between the near neighbors beads in 
the cubic lattice Q. 

Because of the key role that the hairpin-like structure (N*) plays in the formation of fibril-like structures we 
analyzed, the 5950 sequences with unique ground state (NS) to determine their properties. We determined the 
temperature T*, where the probability of finding the chain in the N* conformation P/v*(T), is a maximum. We 
classify those sequences, which satisfy the condition, 1.0 < T*/Tf < 1.25 as amyloidogenic. If T* exceeds 1.257/ the 
probability of accessing N* is negligible, and such sequences are unlikely to aggregate in finite time scale. Even at 
T* = 1.257/, many sequences have the maximum in P/v*(T), Pff? x < 1%. In our lattice model there are only 217 
such sequences, i.e. only 217 (0.66%) sequences that are aggregation prone. In what follows, we analyze a number of 
properties of the 217 sequences in order to provide insights into the tendency of natural sequences to be amyloidogenic. 

4 

Sequence Entropy. The sequence entropy for the 217 sequences is calculated using Sk = — ^^Pki ln(P^), where 

2=1 

Phi is the probability of finding the residue of type i in position k. If Phi = 1/4 (uniform probability) for all i then 
S k = In 4 = 1.386 independent of k. We find that S k = 1.106,1.260,1.329,1.286,1.336,1.318,1.212 and 1.064 for 
k = 1, 2, 8 respectively. The terminal positions have a slightly lower entropy than the positions in the middle where 
the hairpin-like bend is formed. Because there is no position that is strongly conserved we surmise that sequence 
alone cannot be a good predictor of the tendency of a sequence to aggregate. 

Dependence of P^ x on A(= [Ens — En*]/ Ens)- There is striking anit-correlation between P^ x and A 
(Fig. which establishes that in this model the extent of population of the N* conformation is the major determine 
of the propensity to aggregate. Interestingly for foldable sequences the Z-score (related to A in realistic models) is 
large. These considerations explain how natural sequences may have evolved to maximize Z-score so that unneeded 
aggregation is avoided. 

Sequence composition. In the 217 sequences, there are 54 different sequence compositions. We find there are 
different T*/Tf values for identical composition, which implies that sequence composition alone cannot be a good 
predictor of the propensity to aggregate [lj. 

Correlation between the type of beads at different positions is significant. There is a high degree of 
correlation between the nature of beads at various positions (see Tables: Ul llllj) . For example, oppositely charged 
residues are most likely to be found at positions 1 and 8 (Table: [I]). Similarly, if a H bead is in position 2, then with a 
unit probability, a H bead is found in position 7 (Table: (TTJ) . A high preference (0.92) is found for H beads to occupy 
positions 3 and 6 (Table: HI!]). The correlation between the type of beads is not relevant at positions 4 and 5, which 
are the hairpin bend positions (Table. IIV|) . Based on the results in Tables: Hl lIVI we find that +HHPPHH- is one of 
the sequences with high probability to aggregate (substantial PJ^^). This sequence has N* as the first excited state, 
T*/Tf = 1.169 and A = [E NS - E N *]/E N s = 0.105. We also find other sequences with low T*/T f values. Three 
such sequences, which can form ordered fibrils such as the ones shown in Fig. 2a are ++HPPH- -, +-HPPH+- and 
+H-PP+H- with identical A(=0.10) and T*/T / (=1.21). 
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FIG. 4: Anti-correlation of ln(P^* ax ) on A(= [E NS - E N *]/E NS ). The straight line \n(P^ ax ) = -7.15A + 2.69 is a fit to the 
data and has a correlation of 0.9. 



TABLE I: Bead correlation between positions 1 and 8 



Residue in position- 1 


Probability of finding the residues below in position - 8 


+ 




H 


p 


+ 


0.0 


0.861 


0.0 


0.139 




0.861 


0.0 


0.0 


0.139 


H 


0.0 


0.0 


1.0 


0.0 


P 


0.313 


0.313 


0.0 


0.374 



TABLE II: Bead correlation between positions 2 and 7 



Residue in position-2 


Probability of finding the residues below in position - 7 


+ 




H 


p 


+ 


0.0 


0.650 


0.0 


0.350 




0.650 


0.0 


0.0 


0.350 


H 


0.0 


0.0 


1.0 


0.0 


P 


0.389 


0.389 


0.0 


0.222 



TABLE III: Bead correlation between positions 3 and 6 



Residue in position-3 


Probability of finding the residues below in position - 6 


+ 




H 


P 


+ 


0.0 


0.817 


0.017 


0.166 




0.817 


0.0 


0.017 


0.166 


H 


0.027 


0.027 


0.920 


0.266 


P 


0.380 


0.380 


0.040 


0.200 



TABLE IV: Bead correlation between positions 4 and 5 



Residue in position-4 


Probability of finding the residues below in position - 5 


+ 




H 


P 


+ 


0.049 


0.124 


0.444 


0.383 




0.124 


0.049 


0.444 


0.383 


H 


0.336 


0.336 


0.085 


0.243 


P 


0.263 


0.263 


0.22 


0.254 



